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Abstract 

We present an approach to the estimate of the potential of mean force along a generic reaction coordinate based on maximum 
likelihood methods and path-ensemble averages in systems driven far from equilibrium. Following similar arguments, various 
free energy estimators can be recovered, all providing comparable computational accuracy. The method, applied to the unfolding 
process of the a-helix form of an alanine deca-peptide, gives results in good agreement with thermodynamic integration. 
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I. INTRODUCTION 



Estimate of free energy differences is useful for many applications including protein/ligand binding affinities and 
drug design as well as for theoretical perspectives. A rough classification of the plethora of computational methods 
devised for determining free energy differences can be based on the possibility of sampling a system at equilibrium 
or out of equilibrium. Equilibrium approaches include thermodynamic integration free energy perturbationf2] and 
Umbrella Sampling techniques [3]. Representative examples of nonequilibrium techniques are the so-called adaptive 
forcej4(| or potential[5l] bias methods. The efficiency of the latter techniques depends crucially on how fast the history- 
dependent force or potential changes in time, or in other words, how far from equilibrium the simulation is carried out. 
From this point of view, adaptive bias potential methods would be more appropriately defined as quasi-equilibrium 
techniques. 

In the context of nonequilibrium approaches [1,13], a substantially different scenario has been disclosed by JarzynskiQ 
and Crooks[9|, who introduced "truly" nonequilibrium methods for determining free energy differences. In particular 
they proposed two exact equations, referred here as Jarzynski equality and Crooks nonequilibrium work theorem, 
relating free energy differences between two thermodynamic states to the external work done on the system in 
an ensemble of nonequilibrium paths switching between the two states. In a recent paper Shirts et aZ.[lO| have 
demonstrated that the Bennett acceptance ratio[lll can be interpreted, exploiting the Crooks nonequilibrium work 
theorem, in terms of the maximum likelihood (ML) estimate of the free energy difference given a set of nonequilibrium 
work values in the forward and reverse directions. 

One of the major shortcomings of these nonequilibrium techniques is that free energy profile along a given reaction 
coordinate, i.e. the potential of mean force (PMF), is hardly available. With the two sets of forward and reverse 
nonequilibrium paths. Crooks nonequilibrium work theorem[9|] and ML methodflO] yield only the free energy differ- 
ences between the final and initial states. The Jarzynski equality, on the other hand, can in principle be used to 
calculate the PMF. However it is well-known [3, that the exponential average in the Jarzynski equality 

depends crucially on a small fraction of realizations that transiently violate the second law of thermodynamics. Since 
such "magic" realizations are very unlikely to occur among a collection of fast rate realizations, it is clear that the 
potential of mean force cannot be determined accurately by the direct application of the Jarzynski equality. 

In the present paper we demonstrate how to recover the PMF using ML estimators and path-ensemble averages 
in systems driven far from equilibrium T^]. We test the method on the unfolding process of the a- helix form of an 
alanine deca-peptide through steered molecular dynamics (MD) simulations. 

II. THEORY 

A. Description of the dynamical system and notation 

Let us consider a system that can switch between two states, A and B, characterized by different values of an 
arbitrary reaction coordinate C,, namely C,a and Cb- We denote with F (forward) any realization during which the 
reaction coordinate is forced to vary from C^a to C,b with a prescribed time schedule. Accordingly, we denote with R 
(reverse) any realization that brings the reaction coordinate from to C,a with inverted time schedule. The kind of 
computational or experimental technique used for producing the realizations is not relevant. The essential requirement 
is that the used technique furnishes the value of the work done on the system during the realizations. Suppose to 
produce a collection oinp forward realizations and a collection of nn reverse realizations, each realization being started 
from microstates {i.e., phase space points) sampled from an equilibrium distribution (equilibrium microstates of A for 
the F realizations and equilibrium microstates of B for the R realizations). Specifically, an equilibrium microstate of, 
e.g., A is simply obtained b y sa mpling the system in thermal equilibrium with a bath, the reaction coordinate being 
constrained to the value Ca [13l. Il7l|. Furthermore, we assume that all realizations are performed at a very fast rate, 
which implies that they are carried out far from equilibrium. As a consequence, the final microstates of the F and R 
realizations will not be distributed according to the equilibrium distribution of B and A, respectively. It is evident 
that the same holds true for the intermediate microstates of the F and R realizations. For example, the microstates 
characterized by a generic value Cq of the reaction coordinate obtained during a realization starting either from A {F 
realization) or from B (i? realization) will not be equilibrium microstates of the state Q characterized by the reaction 
coordinate Cg. This situation is schematically represented in Fig. [1] We now denote a generic ith F realization 
with F^'', where the superscript Ab means that the thermodynamic state corresponding to the initial microstate is 
A (first letter) and that such microstate is taken from an equilibrium ensemble of microstates (uppercase), while the 
thermodynamic state corresponding to the final microstate is B (second letter) and that such microstate belongs to 
an ensemble of microstates out of equilibrium (lowercase). Following this notation, the segments from (^a to C.Q and 
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from (^Q to (b of the Ff' realization are denoted as F ■ ' and F' , such that F^^ = F ■ + F^ . Analogously, we 
may write ^f'^ = R-^^ + -E^^"- The same symbols with no specified subscripts will be used to indicate a generic 
realization or a collection of realizations. Finally, we define the free energy difference between the states A and B as 
AFab ^Fb-Fa. 



Forward direction 
> 




FIG. 1: Schematic representation of forward and reverse realizations with the notation used in the text. 



B. Background 

Given these two collections of nonequilibrium realizations, one may recover AFab following the ML method by 
Shirts et al.^\^. Such method is based on the maximization of the overall likelihood of obtaining the series of 
measurements (specifically the work done on the system during the F and R realizations) using the free energy 
difference as variational parameter. In our case, the likelihood C of obtaining the given work measurements can 
be expressed as the joint probability of obtaining the forward measurements at the specified work values VF[F^''], 
VF[F^''], !F[F^^], times the joint probability of obtaining the reverse measurements at the specified work values 

rip- nji 

C{AFAB)^X{P{F\W[¥t'']) J]F(i? I IF[Rf ]), (1) 

i=i j=i 

where W^[F^''] and W^[R^°] are the work performed on the system during the F^** and R-^° realizations. The best 
estimate of the free energy difference AFab is the value that maximizes C(AFab), or equivalently its log function: 



d\nC{AFAB) _^ 1 ^ 



OAFab 1 + ILe: Ql3{W[Ff'']-^AFAB) 1 , ILa /3(VK[Rf »]+A_Fab) 

1=1 UR j = l ~^ riF 



= 0, (2) 



where (3 = {kBT)~^, Ub being the Boltzmann constant and T the temperature. A full derivation of above equation 
can be found in Ref. ITol . We point out that Eq. [2] has been derived starting from the Crooks nonequilibrium work 
theorem[^ [l^. This implies that the time schedules of the F and R realizations must be related by time reversal 
symmetry and that the initial microstates of the realizations must be sampled from equilibrium distributions. Note 
also that Eq. [2] is exactly equivalent to the Bennett acceptance ratio method, as can be seen by comparison to Eqs. 
12(a) and 12(b) of Ref. |lll. 
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C. Central result 



Suppose we want to determine the free energy difference I^Faq between the states A and Q using the F and R 
reaUzations introduced above. We recaU that Q is an intermediate thermodynamic state between A and B, in the 
sense that it is characterized by a reaction coordinate, Qq, which is taken arbitrarily from the path connecting C^a to 
C,B^ or viceversa (see Fig. [Ij. As explained above, this free energy difference cannot be determined simply exploiting 
our collections of F and R realizations into Eq. [21 because the segments of the R realizations generally indicated as 
R"^" do not start from equilibrium microstates of the state Q. However, had the R realizations been started from 
equilibrium microstates of Q, i.e., suppose that the R'^" realizations are available in the place of the R^" ones, then 
we could apply Eq. [2] for the calculation of /S.Faq- In the resulting equation, which is equivalent to Eq. [2] with F^**, 
R^" and Is.Fab replaced by 'Pf'^, R^° and AFyig, respectively, the second sum can be rearranged as follows 



E 



— 1 ji^ it'F 



^ ^ ^ ^ dW^O, (3) 



where S is the Dirac delta function and {6(W[R'^''] - W)) is a shorthand for n^/ J2]=i SiWlRj"] - W). We stress 
again that the initial microstates of the RP"" realizations are assumed to be sampled from equilibrium. Of course, since 
we are dealing with nonequilibrium R^" realizations, work measurements VFfR'^"] are unavailable, at least directly. 
Thus, the basic problem here is to derive the unknown quantity ((5(VF[R'5''] — W)) using somehow the overall physical 
information contained into our R^"^ realizations. 

To this aim wc take advantage of a relation due to Crooks, IGJ that establishes a correlation between a function 
of the microstate of the system determined along forward and reverse realizations and the dissipated work done on 
the system during either the forward or the reverse realizations. In particular, setting f[x] to be a function of the 
final microstate x of a forward realization and f[i] to be the same function of the initial microstate i of a reverse 
realization, the following relation holds: 

{m)j, = {f[x]e-P'^-)^, (4) 

where Wd is the work dissipated during the F realization. The subscripts F and R indicate that the ensemble averages 
are calculated on collections of forward and reverse realizations, respectively. Therefore, since the average (/[a;])_R 
is computed on the initial (equilibrium) ensemble of the reverse process, the subsequent dynamics of the system is 
irrelevant and the average equals an equilibrium average of the function f[i]. In applying Eq. 3] to our case, we 
consider the R^* realizations (see Fig. [IJ as the forward ones. This implies that the left side of Eq. |4] refers to an 
ensemble average of the equilibrium state Q. Moreover, for a given microstate of the system Xi, corresponding to the 
final microstate of the R^"^ realization, we set 

f[x,]^Smm'']-W), (5) 

where W is an arbitrary real number and R^" is a segment of the R^° realization. We remark that, given a determin- 
istic dynamical system and given a time schedule for evolving the reaction coordinate, the quantity (5(VF[R^''] — W) 
is a single value function of the microstate Xi . With this provision, the general Eq. |4] takes the following specific form 

{S {WiR'^"] -W)) = (^SiWiR""] - W) e-^^^^^^"^'^ , (6) 

where ly^iR^'] is the work dissipated in the R^* realizations. Since AFbq is unknown, VFd[R-^'] cannot be de- 
termined. However, upon division of Eq. [5] by the equality[l3 {exp {-pWdiR^"^])) = 1, and using the definition 

Wd[Rf «] = W^[Rf - AFbq, we obtain 

(6{W[Ri-]-W) e-^'^[i^'"r 
(6 {W[RQ^] ~W))= \ J^^W^^ (7) 

This equation states that the distribution of the work VK[R'5°] done on the system in a collection of nonequilibrium 
realizations switching the reaction coordinate from C^q to C,a and starting from equilibrium microstates can be recovered 
from a set of nonequilibrium realizations switching the reaction coordinate between the same values, but starting from 
nonequilibrium microstates (realizations R*" in Eq. [7]). The contribution of each work measurement T/FfR'^"] to the 
distribution must however be weighted by a factor depending on the work done on the system to produce the initial 
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nonequilibrium microstate (the factor exp{—f3W[Ilf''])/{exp{—pW['R.^''])) in Eq. [7|). We point out that, while Eq. 
m is valid for both stochastic and deterministic systems ^IQ]. the derivation of Eq. [7] provided here holds only for 
deterministic systems (we have indeed introduced this assumption when defining f[xi]; see Eq. [5]). However, it has 
been numerically proved '^T] that the relations derived in the present article (specifically Eq. fTB]) can also be applied 
successfully to Brownian dynamical systems. Finally, substituting Eq. [7] into Eq. [3] and performing the integral, we 
get 



71 f 



0. (8) 



The above equation is the central result of the present article. By means of a recursive procedure, Eq. [5] allows us 
to determine the free energy difference between the state A and an arbitrary state Q, and hence between any pair of 
states along the reaction path. We note that in Eq. [5] the physical information of both F and R realizations is used, 
albeit not at the maximum extent. In fact, while the R realizations are fully used (note that Rf ' + Rf = Rf for 
the F realizations only the segments F'^'' are actually employed. 

An analogous and symmetrical approach aimed at using the physical information contained into the full F'^^ 
realizations and into the segment R^'' of the R^" realizations allows us to recover a ML estimator to determine 

^-l3wlF^-]\ ^y- e P I ^ i 1 

/ ^ 1 + ILe: Ql3(W[Ff]-AFQB) ^ 1 , lia ^l3iW[R.f]+AFQB) ' ^ ' 

As for Eq. [51 the left sides of Eqs. [5] and [5] are strictly increasing functions of ^Faq and AFqb, respectively. The 
limits of the left side of Eq. [8] for AFaq oo and for AFaq — oo are np and —nji, respectively. The analogous 
limits of the left side of Eq. [9l i.e., AFqb oo and AFqb — oo, give the same values. The monotonic behavior of 
the left sides of Eqs. [5]and[5]and their limit values guarantee the existence of one unique root. Such root corresponds 
to the value of the free energy difference that furnishes the ML estimate of the measured/calculated data. It is finally 
straightforward to prove that both equations have Eq. [2] as special case (set the equivalence between the states Q and 
B in Eq. [5] and between the states Q and A in Eq. [5]) . 

As stated above, the overall physical information available from the F and R realizations is not exploited either in 
Eq. [8] or in Eq. [9l To tackle this fact one can however apply a ML argument to the two collections of measurements 
implied by Eqs. [5] and [SI We first notice that Eq. [5] has been derived maximizing the log function of the likelihood 
C{AFaq) 

ln/:(AF^Q) =^lnP(F | I^[Ff ']) + ^ lnP'(i? | WiRf"]). (10) 

i=i j=i 

Note that in Eq. [TU]the probability in the second sum is primed. This means that such term must be treated with the 
usual reweighting procedure (see derivation of Eq. [5]) because the realizations R*^*^ are unavailable (since they must 
start from equilibrium microstates). The first sum of Eq. [TO] can instead be treated in the standard fashion, because 
the work measurements M^[F'^''] are available from the full F realizations. Analogously, Eq. [9] has been obtained by 
maximizing the log function of C{AFqb) 

lnC{AFQB)=J2^nP'{F \ W[Ff])+J2^nP{R \ W^[Rf ]), (11) 

i=l 3=1 

where the probability in the first sum is primed for the same reasons discussed above. From a formal standpoint, Eqs. 
[TOl and [TT] deal with two independent collections of realizations, i.e., F^'' and R'*^ the former equation and F''' and 
R^« the latter one. Therefore, noting that 

AFqb = AFab - AFaq (12) 

and assuming that the free energy difference AFab is known (using, e.g., the ML estimator of Eq. [2]), we may express 
the overall likelihood of obtaining the given work measurements in both collections as the product C{AFaq)C{AFqb), 
i.e., a function of AFaq alone (or alternatively, AFqb alone). Maximization of the log function of such a product 
with respect to AFaq leads to the following equation (the same estimator would be obtained maximizing the log 
fimction of C{AFaq)C{AFqb) with respect to AFqb) 

d\nC{AFAQ) , dlnCjAFQB) _^ 

dAFAQ + OAFaq ^''^ 
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Since AFab is known, and hence independent on both AFaq and AFqb, Eq- [12] sets the condition 



= -1. (14) 
dAFAQ ^ ' 



Using Eq. [14] into Eq. [TS] leads to the relation 

d\nC{AFAQ) 9 In £( AFqb) 
OAFaq OAFqb 



0. (15) 



The left and right derivatives of Eq. [13] are exactly the left sides of Eqs. [5] and [5] respectively. Therefore, upon 
substitution of Eqs. [8] and [9] into Eq. [15] and using Eq. [12] in the resulting equation, we obtain 



*=1 ^ nr, ^ .7 = 1 ^ np 



-/9W[Ff 



1 + ILe: g/3(VK[Ff ]-AFab+AFaq) ^ ^ i ILa g/3(Vl'[Rf'']+AFAs-AFAQ) ~ ^' ^^^^ 
k—1 ~ nn ^—1 np 



Remember that in the equation above, the quantity AFab must be predetermined via Eq. [2] The left side of Eq. [16] 
is an increasing function in AFaq and the limits for AFaq — *■ c» and for AFaq — *■ — oo have opposite signs, being 
nR + np and — np, respectively. Again, this guarantees the existence of one unique root in the Eq. 1161 

If from the one hand the ML estimator of Eq. [Tn]has the advantage of using the full physical information contained 
into our sets of work measurements, on the other hand it contains the free energy difference AFab that should be 
determined independently. This implies that the error on the estimate of AFab sums to that on AFaq ■ The other 
derived ML estimators, i.e. Eqs. [5] and [5] do not suffer of such a shortcoming. The disadvantage is however that 
these ML estimators do not employ completely the physical information of the measurements in our hands. 



III. NUMERICAL TESTS: TECHNICAL DETAILS 



The ML estimators of Eqs. [5] [1] and [TC] have been applied to compute the PMF for the unfolding process of the 
a-helix form of an alanine deca-peptide (Aio) at finite temperature. Following Ref. [l3, we have used steered MD 
simulations as a device for the numerical experiments, taking the end-to-end distance of Aio as reaction coordinate 
In particular C corresponds to the distance between the N atom of the N-terminus amino-acid (constrained to 
a fixed position) and the N atom of the C-terminus amino-acid (constrained to move along a fixed direction). The 
values of C in the folded and unfolded states of Aio are assumedflT^ to be 15.5 and 31.5 A, respectively. Moreover 
we have arbitrarily assumed the unfolding process of Aio as the forward (F) one. In the context of our notation 
(see Sec. Ill A[) . we therefore set (a = 15.5 A and — 31.5 A. It should be noted that in general the end-to-end 
distance does not determine uniquely the configurational state of polypeptides. However, in the specific case of Aio, 
the equilibrium distribution at C = Ca corresponds to an ensemble of microstates tightly peaked around the a-helix 
form, as for this end-to-end distance alternative structures are virtually impossible. The same holds true for the state 
corresponding to C = Cb, which basically represents an almost fully elongated configuration of the peptide. This 
implies that these two thermodynamic states can be effectively sampled using relatively few microstates obtained 
from equilibrium MD simulations at the given C values. The starting microstates for the F and R realizations have 
been randomly picked (every 5 ps) from two standard MD simulations constraining C to (^a and to , respectively, by 
means of a stiff harmonic potential (force constant equal to 800 kcal mol~^ A^^). In both equilibrium MD simulations 
and in the subsequent steered MD simulations, constant temperature (300 K) has been enforced using a Nose-Hoover 
thermostat [l^ [l^. Force field has been taken from Ref. 2(1. Given the limited size of the sample, no cutoff radius 
has been imposed to the atomic pair interactions and no periodic boundary conditions have been applied. 

For each type of process, F and R, we have generated lO'' realizations guiding ( from (a to (b {F realizations) or 
from Cb to (^a (R realizations) using a harmonic potential dependent on time: 

ni) = ^[C-A(i)]2, (17) 

where the force constant k is reported above. The time-dependence of the steering parameter X{t) determines the 
pulling speed of the nonequilibrium realizations and in general their time schedule. In our case, A(t) varies linearly 
with the time, i.e. X(t) = + At for the F realizations and A(t) = Cb — for the R ones. The work measurement 
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at a given instant f of a realization is calculated integrating the partial derivative of V{t) with respect to time from 
the time zero to the time t. Six series of R/F work measurements differing only in the pulling speed A have been 
performed (A = 80, 160, 320, 533, 800, and 1600 A ns^i). 



IV. NUMERICAL TESTS: RESULTS 



In Fig. [2]we report a comparison between the PMF calculated using the ML estimators of Eqs. [8l [9l and [TBI and the 
exact PMF recovered through thermodynamic integration[l|. To show the correctness of the estimators numerically, 




C (A) 



FIG. 2: PMF of Aio as a function of the reaction coordinate (end-to-end distance). Squares: Eq. [S] triangles: Eq. [O] circles: 
Eq. 1161 solid lines: thermodynamic integration (TI). The PMF profiles from ML estimators are calculated using F and R 
realizations performed with pulling speed of 80 A ns^^. For the sake of clarity the curves are up-shifted. 

in the figure we have drawn the PMF profiles obtained using the slowest pulling speed, i.e. 80 A ns^'"'^. It is noticeable 
the all ML estimators provide an almost perfect agreement with thermodynamic integration. The relevance of this 
result is enforced to the light of early PMF calculations [l3| on the unfolding process of Aig. Free energy estimators, 
such as Jarzynski equality and second order cumulant expansion [ll], using only a slightly faster pulling speed (100 
A ns""'^), give a much worse accuracy than the methods proposed here. This can be more strictly verified comparing 
Fig. 5a of Ref. [l^with the PMF estimates determined using Eqs. l8l [9l and [TBI with pulling speed of 160 A ns"^ (see 
supplementary material). A more comprehensive view of the performances of the ML estimators of Eqs. [8l [9l and [TBI 
is gained by the root mean square deviation of the estimated PMF curves from the exact one: 

1 

2 

(18) 

where FuhiCi) is the value of the PMF a.t ( — Q calculated using one of our ML estimators and F^iiCi) is the 
corresponding exact value determined by thermodynamic integration. In our calculations, the reaction coordinate 
is defined in steps of 0.4 A, i.e. Ci = = 15.5, C2 = 15.9, Cs = 20.3, = Cb = 31.5 A, where N = 41. 

The value of a for the various approaches has been calculated after determining the additive constant of Fml{C) via 
a least squares fitting to T'ti{Q. The value of a obtained from the considered ML estimators for different pulling 
speeds is reported in Fig. |3) The worsening of the accuracy of the ML estimators by increasing the pulling speed of 
the realizations is expected on the basis of statistical reasons. The remarkable result is that all ML estimators have 
comparable accuracy independing on the pulling speed. Moreover, Eq. |9]gives the best accuracy for all pulling speeds 
except for 533 A ns~^, while Eq. 1161 gives systematically an accuracy which is in between those obtained from Eqs. 
|5] and 121 These facts suggest that the formal asymmetry of the ML estimators of Eqs. E] and [5] (see discussion in Sec. 
Ill C|) in comparison to the formal symmetry of Eq. [16] may be relevant in the choice of the most accurate approach. 
In fact, it is known [Tsj that, for a given reaction path of a system, the use of a set of forward realizations in the 
framework of exponential averages for determining the free energy difference between two states may give different 
variance and bias with respect to the same estimate performed using the reverse realizations. We do not exclude that 
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i=l 
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Pulling speed (A / ns) 

FIG. 3: G value (Eq. llSp as a function of the pulling speed for various ML estimators. Squares: Eq. |8l triangles: Eq. |9l circles: 
Eq. 1161 The lines are drawn as a guide for eyes. 

this fact might be related to our observations of Fig. [3] discussed above. In such a case the ML estimator of Eq. [16] 
would be the more appropriate without prior knowledge on the system. 

V. CONCLUSIONS 

We have presented a method for determining the PMF along a given reaction coordinate which is based on ML 
methods and path-ensemble averages in systems driven far from equilibrium. The method has been applied to 
computer experiments on the unfolding process of the a-helix form of an alanine deca-peptide using nonequilibrium 
realizations with various pulling speeds. The estimated PMF is in fair agreement with thermodynamic integration. 
A formula for the variance of PMF estimates generated using this method is still unavailable, though its derivation 
appears straightforward following the guidelines reported in the present article and in the work by Shirts et aZ.[lol|. 
We plan to report on this issue in a forthcoming contribution. 
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